We present a multi-modal dialogue system for interactive learning ofperceptually grounded word meanings from a human tutor. The system integratesan incremental, semantic parsing/generation framework - Dynamic Syntax and TypeTheory with Records (DS-TTR) - with a set of visual classifiers that arelearned throughout the interaction and which ground the meaning representationsthat it produces. We use this system in interaction with a simulated humantutor to study the effects of different dialogue policies and capabilities onthe accuracy of learned meanings, learning rates, and efforts/costs to thetutor. We show that the overall performance of the learning agent is affectedby (1) who takes initiative in the dialogues; (2) the ability to express/usetheir confidence level about visual attributes; and (3) the ability to processelliptical and incrementally constructed dialogue turns. Ultimately, we trainan adaptive dialogue policy which optimises the trade-off between classifieraccuracy and tutoring costs.
展开▼